Provide an overview of the project goals and the motivation for it. Consider that this will be read by people who did not see your project proposal. In our project, we use the SEDA dataset that maps student achievement based on different district level covariates. The Educational Opportunity Project at Stanford University measures educational opportunity in diverse communities all over America. We loaded the file from the publicly available datasets that the SEDA project has online here We use the inner join function to join the achievement scores dataset and the covariates dataset.
What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
In the first part of our project, we aim to understand the variables that predict academic achievement on a district level. This section of our project has three components. First, we will use explortory data analysis and machine learning modeling to determine key predictors of academic achievment on which to focus. Next we will take a closer look at some socio-economic status, unemployment, and poverty as predictors of academic achievement, using Washington and Massachusetts as case studies. Finally, we will explore how receiving free lunch at school predicts academic success.
Data for this portion of the project is obtained from the Stanford
this still needs to be done– describing the data source
# Load achievement data (district, CS, pooled)
# data obtained from: "https://stacks.stanford.edu/file/druid:db586ns4974/seda_geodist_pool_cs_4.0.dta"
# This data set contains data on the outcome of interest, district academic achievement.
# In this data set, academic achievement is measured in terms of standard deviation
# units and is what we will be using for our regression analyses
ach <- read_dta("seda_geodist_pool_cs_4.0.dta")
# Load main covariates data (district level), pool
# data obtained from https://stacks.stanford.edu/file/druid:db586ns4974/district%20covariates.dta"
# This data set contains an extensive list of covariates that will be tested as
# predictors of our outcome, academic achievement
covs <- read_dta("seda_cov_geodist_pool_4.0.dta")
# Merge files together. This keeps only matched districts.
dat <- inner_join(ach, covs, by = c("sedalea", "fips"))
# Subset to the "all" subgroup estimates for all students, only districts
# with an estimate of average achievement, unemployment, poverty, ses avergage, and non-missing SES.
dat <- filter(dat,
subgroup == "all",
!is.na(cs_mn_avg_ol),
!is.na(unempavgall),
!is.na(povertyavgall),
!is.na(sesavgall))
nrow(dat)
## [1] 12438
After creating this data set, we noticed that the covariate data set that we used did not include variables related to district spending. This is an area of interest of our team, as it shows how policies and funding can play a round in academic outcomes. Therefore, obtained an additional SEDA dataset with district spending included and merged this with the data set above.
#importing the data set that includes variables of interest: ppexp_tot, ppexp_inst, pprev_tot
spending <- read_dta("district covariates.dta")
# Subsetting data set to only include variables of interest
# also including the district ID, titled leaid here, and state ID (fips) for joining
spending <- subset(spending, select = c(ppexp_tot, ppexp_inst, pprev_tot, leaid, fips))
# since the name of the district ID variable in the dataset "spending" (leaid) is different
# from the district ID variable in the data set "dat" (sedalea), we need to change this
spending <- rename(spending, sedalea = leaid)
#checking the class of `sedalea` in "spending" df
class(spending$sedalea)
## [1] "character"
#checking the class of `sedalea` in "dat" df
class(dat$sedalea)
## [1] "numeric"
#converting `sedalea` in "spending" df to numeric class so that they are both of the same class
spending$sedalea <- as.numeric(spending$sedalea)
#confirming that `sedalea` in "spending" is now numeric
class(spending$sedalea)
## [1] "numeric"
# Merge files together. This keeps only matched districts.
dat <- inner_join(dat, spending, by = c("sedalea", "fips"))
For the exploratory data analysis, we start by summarizing our outcome of interest: district academic achievement. As mentioned in the code above, the data set that we imported for our measures academic achievement in terms of standard deviation units away form the national mean. While we can compare relative performance using this measure, interpreting standard deviation units for exploratory data analysis is not too useful, and can be difficult to interpret. Therefore, we will import another SEDA dataset that measures academic achievement outcomes in terms of “grade levels” using the variable gcs_mn_avg_ol.
As with cs_mn_avg_ol, this variable indicates the average math and reading scores for grades 3 through 9 from academic years 2008-2009 to 2015-2016. This scale, however, is in “grade levles” where 5.5 is average across all districts (mid point of grades 3 and 8, the grades that participate in testing). Using this scale means that if a district has an average grade level of 7.5, their students math and reading scores are, on average, similar to students who are one grade level higher, when compared to an average district in the nation.
# loading SEDA data that includes academic achievement in "grade levels"
ach.EDA <- read_dta("seda_geodist_pool_gcs_4.0.dta")
ach.EDA <- filter(ach.EDA,
subgroup == "all",
!is.na(gcs_mn_avg_ol))
To begin our exploratory data analysis on our outcome, it would be helpful to first take a look at how much variation we have in our data. We can visualize this first by using a histogram.
ggplot(aes(x=gcs_mn_avg_ol), data = ach.EDA) +
geom_histogram(color="black", fill="#99CCFF") +
xlab("Average Test Scores (Grade Levels)") +
ylab("Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the histogram above, we see that there is decent amount of variation in the district average achievement scores, as measured in “Grade Levels.” For example, some districts have average test scores of around 2.5– that means that district is testing 3 grade levels below the national average of 5.5! On the other hand, there are also some very high achieving districts. For example, some districts have average test scores at around 7.5 grade levels, 2.5 grade levels above the national average.
Our team is hoping to understand some of the district-level factors that leads to such variation in academic performance in the United States.
Before moving onto our analysis, we can also look at the variation in test scores between and within states using box plots:
ach.EDA %>% mutate(stateabb = reorder(stateabb, gcs_mn_avg_ol, FUN = median)) %>%
ggplot(aes(stateabb, gcs_mn_avg_ol, group = stateabb)) +
geom_boxplot(fill = "#99CCFF") +
labs(x = "State Abbreviation", y = "Average District Test Scores (Grade Levels)") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
From the graph above, we notice there is a lot of variation in test scores both between and within states. Districts in Massachusetts have the highest median test score, performing at about 1.5 grade levels above the national average while New Mexico has the lowest median achievement scores with a median of around 0.5 grade levels below the national average. With that being said, within most states there is a lot of variation in performance. Within a given state, take California for example, you can find districts with both extremely high and extremely low test scores.
In our analysis, we hope to understand the factors that contribute to such variation between districts across the United States. The covariate data set has many different variables, many of which may predict district test scores. However, for the purpose of our project, we want to focus in on a few key covariates in order to produce more focused and meaningful conclusions.
To decide our priority predictors, we can start of by visually summarizing the overall relationships between these variables and our outcome for academic achievement, cs_mn_avg_ol. In contrast our achievement variable used in the plots above, cs_mn_avg_ol measures academic achievement in terms of standard deviations away from the national average.
Before we begin, it would be helpful to summarize create a list of our predictors for this analysis. The covariates we are exploring include:
urban: if the district is in a city/urban locale (binary; 0 = no, 1 = yes)avgrdall: average per grade enrollment (number of students)perfrl: percent free lunch in the grade (percentage; range: 0-1)perell: % of all students in the district that are English Language Learners (ELL ) (percentage; range: 0-1)perspeced: % of all students in the district that qualify for Special Education (percentage; range: 0-1)nsch: number of schools in district (count)aides: number of instructional aides (count)corsup: number of instructional coordinators and supervisors (count)elmgui: number of Elementary Guidance Counselors (count)stutch_all: pupil teacher ratio (ratio of number of pupils to number of students in district)ppexp_tot: total per pupil expenditures (dollars)ppexp_inst: total per pupil expenditures on instruction (dollars)pprev_tot : revenue per pupil (dollars)baplus_all: % of adults in district with at least a bachelors degree (percentage)poverty517_all: % of households with 5-17 year olds in poverty (percentage)singmom_all % of household with children with single mom (percentage)snap_all : % of households receiving snap benefits (percentage)samehouse_all: % of households living in the same house as last year (percentage)unemp_all : % unemployed in the district (percentage)inf_fem: percent of 25 - 64 year old females in labor force (percentage)teenbirth_all: percent of 15-19 year olds giving birth in district (percentage)sesall: standardized socio-economic status composite score computed as the first principal component factor score of the following measures: median income, percent with a bachelor’s degree or higher, poverty rate, SNAP rate, single mother headed household rate, and unemployment ratecs_mn_avg_ol: District test-based achievement math and reading scores from ordinary least quares estimateNOTE TO OTHERS– IM STILL WORKING ON THIS SECTION– I CHANGED MY LIST OF PREDICTORS TO JUST INCLUDE THE ONES FROM THE “COVS” DATA SET PLUS THE EXPENDITURE ONES SO THE ABOVE LIST AND THE BELOW GRAPHS ARE NOT APPLICABLE. I WILL FIX THIS AND UPDATE WHENEVER THERES A GAP IN PEOPLE WORKING ON THE .RMD PLEASE CARRY ON :)
Oh and i will also make the scatter plots below look better dw
For the continuous predictors, we can do scatter plots:
ML.dta <- read.csv("complete_data_ML.csv")
ML.dta %>%
gather(predictor, value, c(avgrdall, perfrl, perell, perspeced, nsch, aides,
corsup, elmgui, stutch_all, ppexp_tot, ppexp_inst,
pprev_tot, baplus_all, poverty517_all, singmom_all,
snap_all, samehouse_all, unemp_all, inlf_fem,
teenbirth_all, sesall)) %>%
ggplot(aes(x = value, y = cs_mn_avg_ol)) +
geom_point() +
facet_wrap(~ predictor, scales = 'free_x') +
xlab(NULL) + ylab("District Achievement Scores")
For the binary predictor (
urban), we can use a box plot:
Our above scatter plots confirm what we initially suspected: many of these variables do in fact predict academic achievement. For example, it appears baplus_all, the percent of adults in the district with a bachelor’s degree, is positively correlated with academic achievement. On the other hand, the percent of households in poverty or with a single mother appear to be negatively correlated. For other variables, such as per pupil expenditure (ppexpt_inst), the association is less clear. Overall, while these scatterplots give us a general sense of what variables may be important, it still is not clear which variables matter the most for academic achievement.
To answer this question, we will turn to machine learning models. We will first use random forest to determine variables of importance in predicting academic achievement, as measured by the Gini coeffient. We will then addition use a regression tree to see which variables are included to predict relative test scores.
To begin our machine learning exercises, we first must great our training and test sets. Here we are using training and test sets that both include 50% of the dataset.
#creating a data partition that splits my data frame in half
index_train<- createDataPartition(y = ML.dta$low_dist_ach, times =1, p=0.05, list = FALSE)
#labeling the first "slice" as the train set
train_set <- slice(ML.dta, index_train)
#labelign the second "slice" as the test set
test_set <- slice(ML.dta, -index_train)
Next, we will use a Random Forest Model to determine variables of importance, as measured using the Gini Coefficient.
set.seed(1)
#fitting a random forest with all 22 predictors
rf_fit <- randomForest(cs_mn_avg_ol ~ urban + avgrdall + perfrl + perell + perspeced + nsch + aides + corsup + elmgui + stutch_all + ppexp_tot + ppexp_inst + pprev_tot + baplus_all + poverty517_all + singmom_all + snap_all + samehouse_all + unemp_all + inlf_fem + teenbirth_all + sesall, data = train_set, mtry = 22, importance = TRUE)
variable_importance <- importance(rf_fit)
variable_importance_death <- tibble(Predictor = rownames(variable_importance),
Gini = variable_importance[,1]) %>%
arrange(desc(Gini))
kable(variable_importance_death[1:22,])
| Predictor | Gini |
|---|---|
| perfrl | 66.9215919 |
| baplus_all | 25.9454473 |
| sesall | 19.2590927 |
| singmom_all | 11.5929623 |
| perell | 10.7889856 |
| stutch_all | 9.8462576 |
| poverty517_all | 9.6776906 |
| snap_all | 8.8213275 |
| unemp_all | 8.0199877 |
| elmgui | 7.8561654 |
| avgrdall | 7.7680300 |
| corsup | 4.8952699 |
| pprev_tot | 4.2684019 |
| inlf_fem | 4.1635516 |
| nsch | 4.0984564 |
| ppexp_inst | 3.9072944 |
| ppexp_tot | 3.3380510 |
| samehouse_all | 2.8130618 |
| aides | 2.4035525 |
| urban | 1.2816718 |
| perspeced | 0.9743020 |
| teenbirth_all | -0.1491024 |
#predicting probability of estimates for test set
probs_rf <- predict(rf_fit, newdata = test_set, type = "class")
#converting probabilities to predicted response labels
y_hat_rf <- ifelse(probs_rf > 0.5, 1, 0)
#obtaining the confusion matrix
confusionMatrix(data = as.factor(y_hat_rf),
reference = as.factor(test_set$low_dist_ach),
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9287 1705
## 1 756 1
##
## Accuracy : 0.7905
## 95% CI : (0.7831, 0.7979)
## No Information Rate : 0.8548
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0971
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 5.862e-04
## Specificity : 9.247e-01
## Pos Pred Value : 1.321e-03
## Neg Pred Value : 8.449e-01
## Prevalence : 1.452e-01
## Detection Rate : 8.511e-05
## Detection Prevalence : 6.443e-02
## Balanced Accuracy : 4.627e-01
##
## 'Positive' Class : 1
##
From the above outputs we can see two things. First, our random forest model is fairly accurate at predicting test scores with an accuracy of 80.47%. Second, as measured by the Gini coefficient, percent free lunch is the most important predictor of test scores followed by percent of adults who have their bachelors degree. Other important indicators include socio-economic status and poverty.
To supplement the above results from the random forest, we can also fit our data to a regression tree model.
tree = tree(cs_mn_avg_ol ~ urban + avgrdall + perfrl + perell + perspeced +
nsch + aides + corsup + elmgui + stutch_all + ppexp_tot +
ppexp_inst + pprev_tot + baplus_all + poverty517_all +
singmom_all + snap_all + samehouse_all + unemp_all +
inlf_fem + teenbirth_all + sesall, data = train_set)
summary(tree)
##
## Regression tree:
## tree(formula = cs_mn_avg_ol ~ urban + avgrdall + perfrl + perell +
## perspeced + nsch + aides + corsup + elmgui + stutch_all +
## ppexp_tot + ppexp_inst + pprev_tot + baplus_all + poverty517_all +
## singmom_all + snap_all + samehouse_all + unemp_all + inlf_fem +
## teenbirth_all + sesall, data = train_set)
## Variables actually used in tree construction:
## [1] "perfrl" "baplus_all" "sesall"
## Number of terminal nodes: 8
## Residual mean deviance: 0.03201 = 19.56 / 611
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.769200 -0.104000 0.003238 0.000000 0.118600 0.680600
As seen in teh above output, the regression tree only ened up including three variables in the tree construction: percent of students receiving free lunch (perfrl), percent of adults with at least a bachelors degree (baplus_all), and percent of students who are english language learners (perell).
plot(tree, type = "uniform")
text(tree, cex = 1)
Need to add tree conclusions
So based on these analyses, what predictors should we focus on? It is clear from the random forest and the regression tree models that some variables are more important for academic success than others. For example, percent of students receiving free lunch has high priority in both the random forest model and the regression tree models so warrants attention in future analyses. We additionally want to balance the outcomes of this exploratory data analysis with the interests of our team members. For example, socio-economic status and related predictors, such as poverty and unemployment, are often described as strong predictors of academic success (source?). Therefore, it would also be worthwhile to explore these variable further and the nature of their associations with academic achievement outcomes.
Therefore, we explore different covariates of interest that highly correlate with average district academic achievement, relative to the national mean measured in standard deviation units, including unemployment, free lunch, SES, and poverty. In this section we look at poverty, unemployment, and SES composite scores.
To reiterate what we was mentioned above, we use a stargazer to observe the mean and standard deviations of each variable and then review the correlation between each of the unemployment, SES scores, poverty, and average academic achievement variables. We see that SES, poverty, and unemployment are all highly correlated with one another as expected. Furthermore, we observe that all three variables are correlated with average academic achievement with SES and average test scores having a correlation of 0.76, unemployment and average academic achievement have a negative correlation of -0.57 and poverty and average academic achievement also having a negative correlation of -0.67.
dat2 <- dat %>%
group_by(stateabb) %>%
summarise(ses = mean(sesavgall), poverty = mean(povertyavgall), unemployment = mean(unempavgall))
stargazer(data.frame(dplyr::select(dat, cs_mn_avg_ol, unempavgall, povertyavgall, sesavgall)),
type="text")
##
## ==================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## ------------------------------------------------------------------
## cs_mn_avg_ol 12,366 0.021 0.334 -2.341 -0.185 0.222 1.248
## unempavgall 12,366 0.060 0.020 0.002 0.047 0.071 0.221
## povertyavgall 12,366 0.126 0.062 0.007 0.081 0.159 0.460
## sesavgall 12,366 0.333 0.846 -4.401 -0.157 0.872 2.910
## ------------------------------------------------------------------
#correlation between outcome and unemployment, poverty, and ses
round(cor(dplyr::select(dat, cs_mn_avg_ol, unempavgall, povertyavgall, sesavgall)), 2)
## cs_mn_avg_ol unempavgall povertyavgall sesavgall
## cs_mn_avg_ol 1.00 -0.57 -0.67 0.76
## unempavgall -0.57 1.00 0.65 -0.75
## povertyavgall -0.67 0.65 1.00 -0.92
## sesavgall 0.76 -0.75 -0.92 1.00
After seeing the descriptive statistics of each variable, we plot histograms of unemployment, SES scores, poverty, and average academic achievement to display the distribution of each variable. SES is measured as “sesavgall” which is the SES composite score for all families in the district between 2005-2009 and 2014. “povertyavgall” represents the poverty rate (eb estimate) of all families in the district between 2005-2009 and 2014 and the “unempavgall” represents the unemployment rate (eb estimate) of all families in the district between 2005-2009 and 2014. We create histogram and annotate them with the means and standard deviations of each variable. We see that poverty, academic achievement, and SES all have wider spreads of variation and the distribution of unemployment seems to be more narrow.
# How much variation is there in average academic achievement?
ggplot(aes(x=cs_mn_avg_ol), data = dat) +
geom_histogram(color="black", fill="cyan2", binwidth = 0.05) +
annotate("text", x=-1.8, y=750, label = "Mean=0.021, SD=0.33",
size=5, hjust=0, vjust=0) +
xlab("Average Test Scores (Grade 5.5, CS Scale)") +
ylab("Frequency") +
theme_classic()
# How much variation is there in unemployment?
ggplot(aes(x=unempavgall), data = dat) +
geom_histogram(color="black", fill="lightsalmon1", binwidth = 0.005) +
annotate("text", x=-0.15, y=1010, label = "Mean=0.06 , SD=0.02",
size=5, hjust=0, vjust=0) +
xlab("Average Unemployment") +
ylab("Frequency") +
theme_classic()
# How much variation is there in poverty?
ggplot(aes(x=povertyavgall), data = dat) +
geom_histogram(color="black", fill="darkolivegreen3", binwidth = 0.01) +
annotate("text", x=0.21, y=600, label = "Mean = 0.126, SD=0.062",
size=5, hjust=0, vjust=0) +
xlab("Average Poverty") +
ylab("Frequency") +
theme_classic()
# How much variation is there in SES?
ggplot(aes(x=sesavgall), data = dat) +
geom_histogram(color="black", fill="blueviolet", binwidth = 0.1) +
annotate("text", x=-3.5, y=500, label = "Mean = 0.33, SD=0.85",
size=5, hjust=0, vjust=0) +
xlab("Average SES") +
ylab("Frequency") +
theme_classic()
Now we want to utilize ggplot to plot the association between the three socioeconomic factors including unemployment, SES scores, and poverty with average academic achievement. We also obtain a pearson correlation coefficient (r) for each graph and annotate it on the plot. Each point represent a district and the size of each point is the total enrollment between grades 3-8 in that district. Lastly, we draw a non-linear line that follows the trajectory of our data points. In the first plot, we see that there is a negative association between unemployment and average district academic achievement. Our r= -0.57. Similarly in the next plot, we also observe that there is a negative association between poverty and average district academic achievement with an r = -0.67. In the last plot, we see that there is actually a positive association between SES and average district academic achievement with the highest correlation coefficient of r = 0.76.
# What is the association between unemployment and average academic achievement?
r1 <- round(cor(dat$cs_mn_avg_ol, dat$unempavgall),2)
ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha = 0.3, pch=21, color = "black", fill = "lightsalmon1", aes(size = totenrl)) +
geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
annotate("text", x=0.03, y=-1.5, hjust=1, vjust=0,
label = paste0("r=",r1)) +
scale_size(range = c(1,30)) +
guides(size="none") +
ggtitle("What is the association between Unemployment and average academic achievement?") +
xlab("Average Unemployment") +
ylab("Average Test Scores (Grade 5.5, CS Scale)")
# What is the association between poverty and average academic achievement?
r2 <- round(cor(dat$cs_mn_avg_ol, dat$povertyavgall),2)
ggplot(aes(x=povertyavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha = 0.3, pch=21, color = "black", fill = "darkolivegreen3", aes(size = totenrl)) +
geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
annotate("text", x=0.05, y=-1.5, hjust=1, vjust=0,
label = paste0("r=",r2)) +
scale_size(range = c(1,30)) +
guides(size="none") +
ggtitle("What is the association between poverty and average academic achievement?") +
xlab("Average Poverty") +
ylab("Average Test Scores (Grade 5.5, CS Scale)")
# What is the association between SES and average academic achievement?
r3 <- round(cor(dat$cs_mn_avg_ol, dat$sesavgall),2)
ggplot(aes(x=sesavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha = 0.3, pch=21, color = "black", fill = "blueviolet", aes(size = totenrl)) +
geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
annotate("text", x=-3, y=0.75, hjust=1, vjust=0,
label = paste0("r=",r3)) +
scale_size(range = c(1,30)) +
guides(size=FALSE) +
ggtitle("What is the association between SES and average academic achievement?") +
xlab("Average SES") +
ylab("Average Test Scores (Grade 5.5, CS Scale)")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
First, we create boxplots to display the range of the median unemployment rates between all 50 states. Here, we observe that South Dakota has the lowest unemployment rate and Mississippi has one of the highest unemployment rates (excluding DC). Then, Montana seems to have similar unemployment rates compared to Massachusetts. We note this to plot four states that we can compare in terms of their unemployment-academic achievement gradient.
Next, we found that according to the Bureau of Labor Statistics, Nebraska has the lowest unemployment rate as of September 2021 at 2.0. Washington and Massachusetts have very similar unemployment rates of 4.9 and 5.2, respectively. Lastly, California has the highest unemployment rate of 7.5. Therefore, we also specifically look at the differences in the unemployment-achievement gradient between Nebraska, Washington, Massachusetts, and California. To start off, we plot the boxplots of unemployment rates for the four states and note the differences in their medians.
Then, in one of our final plots titled “How does the Unemployment-Achievement gradient differ in WA, MA, NE, and CA?”, we notice that CA has a huge range of unemployment rates with the line stretching wide into higher unemployment scores. Meanwhile, the line for Nebraska has a range that stretches into the low unemployment rates. We see that all four lines have negative slopes with higher unemployment rates with lower scores of academic achievement and lower unemployment rates with higher scores of academic achievement. However, when comparing between states, the two lines for CA and NE seem to have very similar slopes that overlap with one another. Massachusetts and Washington have similar ranges of unemployment rates but the line is overall at a higher level. Districts in MA display higher average academic achievement scores in standard deviation units (relative to the national mean) overall compared to CA, NE, and WA.
Finally, in the last plot labeled “How does the Unemployment-Achievement gradient differ in MS, MA, SD, and MT?”, we also plot the gradients for the four states of Massachusetts, Montana, South Dakota, and Mississippi specified earlier. Here, we notice that South Dakota has a wide range of unemployment rates that stretch into lower numbers. MT and SD seem to have similar slopes while MS has a slope that is closer to zero. The unemployment-achievement gradient slope seems to be highest in magnitude (most negative) for MA but again, we notice that the average academic achievement scores are higher overall at all unemployment rates compared to all other states.
We note that there are several flaws in solely using unemployment as a covariate. We believe that there could be additional variation happening that is not just coming from unemployment. For further analyses, we can look at additional covariates of interest that may impact academic achievement such as income earned, % of students receiving free lunch, poverty levels, environmental factors, housing, and social policies. We also ask why Massachusetts has significantly higher scores of academic achievement at every level of unemployment compared to several other states. Perhaps this can be due to the educational system and policies in MA and/or the different programs available to school districts in Massachusetts. Other variables that may impact the academic performance of students in each state include family environments (such as domestic violence and abuse), racism, and other economic and social measures.
# The unemployment distribution between all 50 states ordered by the median
p_unempdist <- dat %>%
mutate(stateabb = reorder(stateabb, unempavgall, FUN = median)) %>%
ggplot(aes(stateabb, unempavgall)) +
geom_boxplot(color = "darkorchid4") +
#scale_y_discrete() +
ylim(0, 0.2) +
scale_x_discrete() +
theme(axis.text.x = element_text(angle = 90, hjust = 0.5, size = 7)) +
ggtitle("Distribution of Unemployment rates for each state") +
xlab("States") +
ylab("Unemployment Rate")
p_unempdist
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
# The unemployment distribution between WA vs. MA vs. NE vs. CA
dat %>%
filter(stateabb == "WA" | stateabb == "MA" | stateabb == "NE" | stateabb == "CA") %>%
ggplot() +
geom_boxplot(aes(x= stateabb, y=unempavgall, fill=stateabb), color = "black", outlier.colour = "navy", outlier.size = 1, notch = FALSE) +
scale_y_continuous(trans = "sqrt") +
xlab("") +
ylab("Unemployment") +
theme(legend.position ="none")
# How does the unemployment gradient compare in WA vs. MA vs. NE vs. CA?
dat_wa_ma_ne_ca <- filter(dat, stateabb %in% c("WA","MA", "NE", "CA"))
ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha = 0.2, pch=21, color = "black", fill = "grey", aes(size = totenrl)) +
geom_point(alpha=0.7, pch=21, color="black", aes(size=totenrl, fill=stateabb),
data = dat_wa_ma_ne_ca) +
geom_smooth(se=F, lwd=0.5, lty=2, method="lm", color="black", aes(fill=stateabb), formula = y~x,
data = dat_wa_ma_ne_ca) +
scale_size(range = c(1,30)) +
guides(size="none", color="none") +
labs(fill = "State") +
ggtitle("Unemployment-achievement points in WA, MA, NE, and CA?") +
xlab("Average Unemployment") +
ylab("Average Test Scores (Grade 5.5, CS Scale)")
ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha=0.04, pch=21, color="black", aes(size=totenrl, fill=stateabb),
data = dat_wa_ma_ne_ca, show.legend = FALSE) +
#geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
scale_size(range = c(1,30)) +
geom_smooth(se=F, lwd=1, lty=1, method="lm", aes(color=stateabb), formula = y~x,
data = dat_wa_ma_ne_ca) +
ggtitle("How does the Unemployment-Achievement gradient differ in WA, MA, NE, and CA?") +
xlab("Average Unemployment") +
ylab("Average Test Scores (Grade 5.5, CS Scale)") +
labs(color = "State")
# How does the unemployment gradient compare in MS vs. MA vs. SD vs. MT?
dat_ms_ma_sd_mt <- filter(dat, stateabb %in% c("MS","MA", "SD", "MT"))
ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha=0.04, pch=21, color="black", aes(size=totenrl, fill=stateabb),
data = dat_ms_ma_sd_mt, show.legend = FALSE) +
scale_size(range = c(1,30)) +
geom_smooth(se=F, lwd=1, lty=1, method="lm", aes(color=stateabb), formula = y~x,
data = dat_ms_ma_sd_mt) +
ggtitle("How does the Unemployment-Achievement gradient differ in MS, MA, SD, and MT?") +
xlab("Average Unemployment") +
ylab("Average Test Scores (Grade 5.5, CS Scale)") +
labs(color = "State")
Furthermore, we chose to plot the academic achievement by SES composite scores for MA and WA to observe whether we will observe a difference in the relationship between SES and academic achievement and the relationship between unemployment and academic achievement. We see that there are some differences in the range and the gradient. By simply looking at the points, we notice that Massachusetts and Washington generally overlap in their relationships but MA does seem to have slightly higher levels of SES.
Massachusetts and Washington may have similar unemployment rates but may look different on other scales such as housing or poverty or income levels. Therefore, we want to consider other covariates, such as % of free lunch, an indicator that our previous regression tree included as a predictor. We saw that the random forest and decision tree model mentioned earlier concludes that percent of students receiving free lunch has high priority in influencing academic achievement.
# How does the SES gradient compare in WA vs. MA?
dat_wa_ma <- filter(dat, stateabb %in% c("WA","MA"))
ggplot(aes(x=sesavgall, y=cs_mn_avg_ol), data = dat) +
geom_point(alpha = 0.2, pch=21, color = "black", fill = "grey", aes(size = totenrl)) +
geom_point(alpha=0.7, pch=21, color="black", aes(size=totenrl, fill=stateabb),
data = dat_wa_ma) +
geom_smooth(se=F, lwd=0.5, lty=2, method="lm", color="black", aes(fill=stateabb),
data = dat_wa_ma) +
scale_size(range = c(1,30)) +
guides(size=FALSE, color=FALSE) +
labs(fill = "State") +
ggtitle("How does the SES-achievement gradient differ in WA and MA?") +
xlab("Average SES") +
ylab("Average Test Scores (Grade 5.5, CS Scale)")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## `geom_smooth()` using formula 'y ~ x'
Since we are interested in school-related food-assistance policy, we chose percent free or reduced lunch in the district as our variable of interest for this analysis. While SNAP benefits influence academic achievement outcomes, the program targets individuals and households, not schools directly. Moreover, as shown before, percent free or reduced lunch was a stronger predictor of academic achievement than SNAP benefits. We will estimate the effect of percent free or reduced lunch, socioeconomic status, and geography on academic achievement by running a regression model.
#regression model no interaction
model <- lm(cs_mn_avg_ol~perfrl+urban+sesavgall, data=dat_free_lunch)
broom::tidy(model)
From our model, we can see that our main predictor (percent free or reduced lunch) as well as the other two covariates (socioeconomic status and geography) have a statistically significant relationship with the outcome variable (academic achievement) at a 5% significance level. Specifically:
perfrl: Among rural districts with average SES, a one unit increase in percent free or reduced lunch was associated with being 1.034795357 standard deviation units below the national district average academic achievement.
urban: Districts in urban areas with average SES and no percent free or reduced lunch had academic achievement rates that were 0.005108208 standard deviation units below the national district average academic achievement.
sesavgall: Among districts in rural areas and with no percent free or reduced lunch, a one unit increase in socioeconomic status was associated with being 0.075998091 standard deviation units above the national district average academic achievement.
This model assumes that that the effect of any predictor (percent free or reduced lunch, SES, geography) on the outcome (academic achievement) is independent of the other predictors in the model. However, this may not be a reasonable assumption given that percent free or reduced lunch may have a different effect on academic achievement based on different values of socioeconomic status and urban versus rural settings. Therefore, as opposed to considering the effect of socioeconomic status and geography in isolation, we could see how they interact with perfect free or reduced lunch by adding interaction terms in our model.
#regression model with interaction
model2 <- lm(cs_mn_avg_ol~perfrl+urban+sesavgall+perfrl*urban+perfrl*sesavgall, data=dat_free_lunch)
broom::tidy(model2)
From this interaction model, we can see that our interaction terms are both statistically significant. Hence, we know that the effect of percent free or reduced lunch on academic achievement is different for urban versus rural districts as well as for different values of socioeconomic status. We can show the interaction visually by graphing the relationship between percent free or reduced lunch and academic achievement by urban versus rural districts and for different values of socioeconomic status. However, to better plot and interpret the values of socioeconomic status which is currently a continuous variable, we can create categories like they do in the Stanford Education Data Archive Technical Documentation (https://stacks.stanford.edu/file/druid:db586ns4974/seda_documentation_4.0.pdf).
dat_free_lunch <- dat_free_lunch %>% mutate(ses_cat = case_when(
sesavgall < -2.5 ~"Below 2.5",
sesavgall >= -2.5 & sesavgall < -1.5 ~ "Between -2.5 & -1.5",
sesavgall >= -1.5 & sesavgall < -0.5 ~ "Between -1.5 & -0.5",
sesavgall >= -0.5 & sesavgall < 0.5 ~ "Between -0.5 & 0.5",
sesavgall >= 0.5 & sesavgall < 1.5 ~ "Between 0.5 & 1.5",
sesavgall >= 1.5 & sesavgall < 2.5 ~ "Between 1.5 & 2.5",
sesavgall >= 2.5 ~"Above 2.5"))
library(ggpubr)
F = as.formula("y~x")
lunch_plot<-dat_free_lunch %>% mutate(urban=recode(urban,`1` = "Urban", `0` = "Rural")) %>% ggplot(aes(perfrl, cs_mn_avg_ol, color=as.factor(urban))) + geom_smooth(method = "lm", se=FALSE)+labs(x="% Free or Reduced Lunch", y="Academic Achievement", title="Free/Reduced Lunch & Academic Achievement by Geography")+guides(color=guide_legend(title="Geography"))+stat_regline_equation(label.x = 0.75); lunch_plot
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 590 rows containing non-finite values (stat_smooth).
## Warning: Removed 590 rows containing non-finite values (stat_regline_equation).
ses_plot<-dat_free_lunch %>% drop_na() %>%ggplot(aes(perfrl, cs_mn_avg_ol, color=as.factor(ses_cat))) + stat_smooth(method="lm",fullrange=TRUE, se=FALSE)+labs(x="% Free or Reduced Lunch", y="Academic Achievement", title="Free/Reduced Lunch & Academic Achievement by SES")+guides(color=guide_legend(title="SES category"))+stat_regline_equation(label.x = 0.7, label.y.npc=0.94); ses_plot
## `geom_smooth()` using formula 'y ~ x'
The graphs above show that the relationship between percent free or reduced lunch and academic achievement is in fact different for districts in rural versus urban areas as well as for different values of socioeconomic status. The slope for urban areas is slightly steeper than that of rural areas meaning that the effect of percent free or reduced lunch on academic achievement in districts is greater in urban areas than rural areas. Furthermore, the relationship between percent free or reduced lunch and academic achievement changes more rapidly in urban areas than in rural areas. We can also see that the relationships are also different at the different categories of socioeconomic status, especially for districts that are at the extremes of socioeconomic status (2.5 standard deviation units above and 2.5 standard deviation units below the average national socioeconomic status). The relationship between percent free or reduced lunch and academic outcomes is positive for districts that are 2.5 standard deviation units above whereas for all other districts the relationship is negative. Additionally, it seems that the relationship is particularly strong in districts with SES that are between 1.5 and 2.5 standard deviation units above the average national socioeconomic status.test